39 research outputs found

    Scalable Recommendation with Poisson Factorization

    Full text link
    We develop a Bayesian Poisson matrix factorization model for forming recommendations from sparse user behavior data. These data are large user/item matrices where each user has provided feedback on only a small subset of items, either explicitly (e.g., through star ratings) or implicitly (e.g., through views or purchases). In contrast to traditional matrix factorization approaches, Poisson factorization implicitly models each user's limited attention to consume items. Moreover, because of the mathematical form of the Poisson likelihood, the model needs only to explicitly consider the observed entries in the matrix, leading to both scalable computation and good predictive performance. We develop a variational inference algorithm for approximate posterior inference that scales up to massive data sets. This is an efficient algorithm that iterates over the observed entries and adjusts an approximate posterior over the user/item representations. We apply our method to large real-world user data containing users rating movies, users listening to songs, and users reading scientific papers. In all these settings, Bayesian Poisson factorization outperforms state-of-the-art matrix factorization methods

    Ask the GRU: Multi-Task Learning for Deep Text Recommendations

    Full text link
    In a variety of application domains the content to be recommended to users is associated with text. This includes research papers, movies with associated plot summaries, news articles, blog posts, etc. Recommendation approaches based on latent factor models can be extended naturally to leverage text by employing an explicit mapping from text to factors. This enables recommendations for new, unseen content, and may generalize better, since the factors for all items are produced by a compactly-parametrized model. Previous work has used topic models or averages of word embeddings for this mapping. In this paper we present a method leveraging deep recurrent neural networks to encode the text sequence into a latent vector, specifically gated recurrent units (GRUs) trained end-to-end on the collaborative filtering task. For the task of scientific paper recommendation, this yields models with significantly higher accuracy. In cold-start scenarios, we beat the previous state-of-the-art, all of which ignore word order. Performance is further improved by multi-task learning, where the text encoder network is trained for a combination of content recommendation and item metadata prediction. This regularizes the collaborative filtering model, ameliorating the problem of sparsity of the observed rating matrix.Comment: 8 page

    On Sampling Strategies for Neural Network-based Collaborative Filtering

    Full text link
    Recent advances in neural networks have inspired people to design hybrid recommendation algorithms that can incorporate both (1) user-item interaction information and (2) content information including image, audio, and text. Despite their promising results, neural network-based recommendation algorithms pose extensive computational costs, making it challenging to scale and improve upon. In this paper, we propose a general neural network-based recommendation framework, which subsumes several existing state-of-the-art recommendation algorithms, and address the efficiency issue by investigating sampling strategies in the stochastic gradient descent training for the framework. We tackle this issue by first establishing a connection between the loss functions and the user-item interaction bipartite graph, where the loss function terms are defined on links while major computation burdens are located at nodes. We call this type of loss functions "graph-based" loss functions, for which varied mini-batch sampling strategies can have different computational costs. Based on the insight, three novel sampling strategies are proposed, which can significantly improve the training efficiency of the proposed framework (up to ×30\times 30 times speedup in our experiments), as well as improving the recommendation performance. Theoretical analysis is also provided for both the computational cost and the convergence. We believe the study of sampling strategies have further implications on general graph-based loss functions, and would also enable more research under the neural network-based recommendation framework.Comment: This is a longer version (with supplementary attached) of the KDD'17 pape

    Characterizing and predicting repeat food consumption behavior for just-in-time interventions

    Get PDF
    National Research Foundation (NRF) Singapore under its International Research Centres in Singapore Funding Initiativ

    Nutrigenomics: future for sustenance

    Get PDF
    Nutrigenomics deals with the effect of foods and food constituents on gene expression. It is a new concept in disease prevention and cure. Nutrigenomics conveys how nutrients influence our body to express genes, whereas nutrigenetics refers to how our body responds to nutrients. The various bioactive food components can alter the gene expression mechanisms. But our actual knowledge is so insufficient that the only use of such information may help to satisfy our imagination. If science could arrive at some more precise facts, that would have vast applications in medicine

    Impact of COVID-19 on cardiovascular testing in the United States versus the rest of the world

    Get PDF
    Objectives: This study sought to quantify and compare the decline in volumes of cardiovascular procedures between the United States and non-US institutions during the early phase of the coronavirus disease-2019 (COVID-19) pandemic. Background: The COVID-19 pandemic has disrupted the care of many non-COVID-19 illnesses. Reductions in diagnostic cardiovascular testing around the world have led to concerns over the implications of reduced testing for cardiovascular disease (CVD) morbidity and mortality. Methods: Data were submitted to the INCAPS-COVID (International Atomic Energy Agency Non-Invasive Cardiology Protocols Study of COVID-19), a multinational registry comprising 909 institutions in 108 countries (including 155 facilities in 40 U.S. states), assessing the impact of the COVID-19 pandemic on volumes of diagnostic cardiovascular procedures. Data were obtained for April 2020 and compared with volumes of baseline procedures from March 2019. We compared laboratory characteristics, practices, and procedure volumes between U.S. and non-U.S. facilities and between U.S. geographic regions and identified factors associated with volume reduction in the United States. Results: Reductions in the volumes of procedures in the United States were similar to those in non-U.S. facilities (68% vs. 63%, respectively; p = 0.237), although U.S. facilities reported greater reductions in invasive coronary angiography (69% vs. 53%, respectively; p < 0.001). Significantly more U.S. facilities reported increased use of telehealth and patient screening measures than non-U.S. facilities, such as temperature checks, symptom screenings, and COVID-19 testing. Reductions in volumes of procedures differed between U.S. regions, with larger declines observed in the Northeast (76%) and Midwest (74%) than in the South (62%) and West (44%). Prevalence of COVID-19, staff redeployments, outpatient centers, and urban centers were associated with greater reductions in volume in U.S. facilities in a multivariable analysis. Conclusions: We observed marked reductions in U.S. cardiovascular testing in the early phase of the pandemic and significant variability between U.S. regions. The association between reductions of volumes and COVID-19 prevalence in the United States highlighted the need for proactive efforts to maintain access to cardiovascular testing in areas most affected by outbreaks of COVID-19 infection

    Scalable inference of discrete data: user behavior, networks and genetic variation

    No full text
    Recent years have seen explosive growth in data, models and computation. Massive data sets and sophisticated probabilistic models are increasingly used in the fields of high-energy physics, biology, genetics and in personalization applications; however, many statistical algorithms remain inefficient, impeding scientific progress. In this thesis, we present several efficient statistical algorithms for learning from massive discrete data sets. We focus on discrete data because complex and structured activity such as chromosome folding in three dimensions, human genetic variation, social network interactions and product ratings are often encoded as simple matrices of discrete numerical observations. Our algorithms derive from a Bayesian perspective and lie in the framework of directed graphical models and mean-field variational inference. Situated in this framework, we gain computational and statistical efficiency through modeling insights and through subsampling informative data during inference. We begin with additive Poisson factorization models for recommending items to users based on user consumption or ratings. These models provide sparse latent representations of users and items, and capture the long-tailed distributions of user consumption. We use them as building blocks for article recommendation models by sharing latent spaces across readership and article text. We demonstrate that our algorithms scale to massive data sets, are easy to implement and provide competitive user recommendations. Then, we develop a Bayesian nonparametric model in which the latent representations of users and items grow to accommodate new data. In the second part of the thesis, we develop novel algorithms for discovering overlapping communities in large networks. These algorithms interleave non-uniform subsampling of the network with model estimation. Our network models capture the basic ways in which nodes connect to each other, through similarity and popularity, using mixed-memberships representations and generalized linear model formulation. Finally, we present the TeraStructure algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (10^{12} observed genotypes, e.g, 1M individuals at 1M SNPs). On real genomic data collected from thousands of individuals, TeraStructure is faster than existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure is highly accurate and is the only method that can complete its analysis
    corecore